White wine quality analysis

This document is composed of 5 sections.

General information about dataset

Please navigate using the tabs to see the different contents.

Dataset content

This tidy data set contains 4,898 white wines with 11 variables on quantifying the chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

The question we would like to answer is: * Which chemical properties influence the quality of white wines

Dataset structure
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

Our dataset consist of 13 variables. The X variables is only a row identifier and will not be considered in the rest of this analysis. It means we have 12 meaningfull variables. It is composed of around 4900 observations.

Dataset sample rows
##    fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1            7.0             0.27        0.36           20.7     0.045
## 2            6.3             0.30        0.34            1.6     0.049
## 3            8.1             0.28        0.40            6.9     0.050
## 4            7.2             0.23        0.32            8.5     0.058
## 5            7.2             0.23        0.32            8.5     0.058
## 6            8.1             0.28        0.40            6.9     0.050
## 7            6.2             0.32        0.16            7.0     0.045
## 8            7.0             0.27        0.36           20.7     0.045
## 9            6.3             0.30        0.34            1.6     0.049
## 10           8.1             0.22        0.43            1.5     0.044
##    free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                   45                  170  1.0010 3.00      0.45     8.8
## 2                   14                  132  0.9940 3.30      0.49     9.5
## 3                   30                   97  0.9951 3.26      0.44    10.1
## 4                   47                  186  0.9956 3.19      0.40     9.9
## 5                   47                  186  0.9956 3.19      0.40     9.9
## 6                   30                   97  0.9951 3.26      0.44    10.1
## 7                   30                  136  0.9949 3.18      0.47     9.6
## 8                   45                  170  1.0010 3.00      0.45     8.8
## 9                   14                  132  0.9940 3.30      0.49     9.5
## 10                  28                  129  0.9938 3.22      0.45    11.0
##    quality
## 1        6
## 2        6
## 3        6
## 4        6
## 5        6
## 6        6
## 7        6
## 8        6
## 9        6
## 10       6
Dataset descriptive statistics
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

Variable analysis

This section shows several distribution charts for each variable. Please use the tab to navigate from one analysis to the other.

The orange chart is based on dataset information as they are. The red vertical line shows the 95% quantile threshold. The blue chart is based on dataset information without upper outliers. Outliers are identified using the Inter Quartile method. The grey chart show when relevant the data set information without outliers using a log10 scale. Associated descriptive statistics are provided (when relevant)

Fixed acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

Fixed acidity distribution is rather a normal distribution with an average value at 6.855 g/dm^3. We see there are some outliers with values beyond 10 g/dm^3.

Volatile acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

Volatile acidity distribution is a rather normal with an average value of 0.2782 g/dm^3. We see there are some outliers with values greater than 0.5 g/dm^3.

Citric acid

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

The citric acid distribution is rather normal with an average value of 0.3342 g/dm^3. We see there are some outliers with values above 0.6 g/dm^3.

We see a pic just below 0.5 g/dm^3. Has it is just below 0.5, it could be interesting to understand how the associated measures were done and if we do not have a measurement system limit in that case.

Residual sugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

The desidual sugar distribution is skewed with an average value of 6.491 g/dm^3. There are some outliers with values above 22.5 g/dm^3.

When looking to the log 10 transformed distribution, we see a at least bimodal distribution. When looking to above 5 g/dm^3 sugar values, we can also say we have multimodal distribution.

It could be interesting to segregate information for wine having value less and above 5 g/dm^3 for residual sugar.

Chlorides

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

Chlorides distribution is rather normal with an average value of 0.04577 g/dm^3. There are some outliers with values above 0.07 g/dm^3.

Free sulfur dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

Free sulfur dioxide distribution is rather normal with an average value of 35.31 mg/dm^3. There are some outliers with values above 80 mg/dm^3.

Total sulfur dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

Total sulfur dioxide distribution is rather normal with an average value of 134 mg/dm^3. There are some outliers with values above 160 mg/dm^3.

Density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

Wine density distribution is rather normal with an average value of 0.9940 g/cm^3. There are some outliers with values above 1.0025 g/cm^3.

pH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

pH distribution is rather normal with an average value of 3.188. There are some outliers with values above 3.55.

It means that Vinho Verder is a pretty acid wine. This is coherent with acidity of grape fruit juice (see https://en.wikipedia.org/wiki/PH#/media/File:216_pH_Scale-01.jpg).

Reminder: pH lader goes from 0 to 14. Neutral pH is 7. Values below 7 mean acidity. Value above 7 mean basicity.

Sulphates

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

Sulfates distribution is rather normal with an average value of 0.4898 g/dm^3. There are some outliers with values above 0.76 g/dm^3.

Alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

When looking to alcohol distribution, we can see a skewed distribution. Average alcohol level of 10.51 %/vol. This a pretty soft wine (in average wines have 12% to 14% alcohol level). There are no real outliers when using the interquantile methodolody.

The log 10 transformation does not materialise any multimodal distribution.

Quality

We can identify 3 different quality groups. The low one with score up to 4. The medium ones with score of 5, 6 or 7. The good ones with socres of 8 or 9.

Most of the wines are considered having a medium quality. We can see we have very few wines considered as bad (quality = 3) and even fewer rated very good (quality = 9). We have no “crappy” wines (quality = 0 or 1) nor outstanding wines (quality = 10).

Data cleaning

From the dataset, we will exclude all lines having at least one value considered as an outlier (based on the inter quantile method).

## [1] "Initial number of rows: 4898"
## [1] "Number of rows in cleaned dataset: 4074"

Correlation analysis

Please navigate using the tabs to see the two different correlation plots.

Correlation heatmap

Quality seems to be negatively correlated with density and positively correlated with alcohol. But quality related correlation coefficient are pretty low (in absolute value)!

Correlation matrix plot

The high correlation are:

  • density and alcohol
  • residual sugar & density

Quality is partially correlated with alcohol and density, chlorides.

Nevertheless, we see that the point cloud are often dispersed. It explains the correlation factor are often less than 0.5 (absolute value).

In the rest of this document, only aboslute values of correlation factors will be mentioned.

Variable correlation analysis

In this section, we will analyse the different correlation between quality and all other variables. Please use the tab to navigate from one analysis to the other.

Quality vs Fixed Acidity

## [1] "Correlation factor:  0.0524192864513668"

Correlation factor is 0.05. Watherver the quality is the average values are pretty much the same (between 6.6 and 7 g/dm^3). We see a very high dispersion of values, whatever the quality is.

We can not identify any correlation pattern between fixed acidity and quality.

Quality vs Volatile Acidity

## [1] "Correlation factor:  0.117125850001902"

Correlation factor is 0.12. Watherver the quality is the average values are pretty much the same (between 0.25 and 0.32 g/dm^3). We see a very high dispersion of values, whatever the quality is.

We can not identify any correlation pattern between volatile acidity and quality.

Quality vs Citric Acid

## [1] "Correlation factor:  0.0358270876981967"

Correlation factor is 0.04. If we exclude the low quality wines, we see that quality tend to increase with an increase of citric acidity.

Watherver the quality is the average values are pretty much the same (between 0.3 and 0.35 g/dm^3).

We can not identify any correlation pattern between volatile acidity and quality.

Quality vs Residual sugar

## [1] "Correlation factor:  0.10565960642498"

Correlation factor is 0.11. We can see that very good quality wines have a low residual sugar quantity. Quality of medium and good wines tends to increase with a decrease of residual sugar. This is logic as the vinho verder wine is supposed to be a dry wine.

There are much bigger dispersion of values and mean values for each wine quality. Nevertheless, we can not identify any correlation pattern between residual sugar and quality.

Quality vs Chlorides

## [1] "Correlation factor:  0.279538296180308"

Correlation factor is 0.28. This is one of the highest correlation factors.

We can see that for low quality wine, quality increases when chlorides quantity is higher. For medium and good quality wine, we have the opposite trend.

Nevertheless, due to the high dispersion of values, we can not identify any correlation pattern between chlorides and quality.

Quality vs Free Sulfur Dioxide

## [1] "Correlation factor:  0.0170360242390806"

Correlation factor is 0.02. This is very low.

Here, we can see an interesting information. Low quality wines tend to have a low free sulfur dioxide figure. The sulfur dioxide in wine is used to avoid wine oxydation and other chemical reactions. We can think that these kinds of reactions occured for wines having lof free sulfur dioxide figures. Neverthelees, we can see that free sulfur dioxide figures are very dispersed whatever the wine quality is. Therefore, we can not identify any correlation pattern between free sulfure dioxide and quality.

Quality vs Total Sulfur Dioxide

## [1] "Correlation factor:  0.165039356154331"

Correlation factor between these two variables is pretty low (0.17).

We can see that a distinction between low quality wines and medium/good ones.

For good/medium quality wines, the lower the total sufur dioxide is, the better it is. We see that low quality wines, total sulfur dioxide mean values are the lowest ones (about 110 mg/dm^3).

Nevertheless, big data dispersion does not allow us any correlation conclusion.

Quality vs Density

## [1] "Correlation factor:  0.298326809705554"

The correlation factor between density and quality is one of the highest with a value of 0.30. For good/medium quality wines, the lower the density is, the better it is. As density and residual sugar are highly correlated, this behaviour is not a surprise.

Nevertheless, big data dispersion does not allow us any correlation conclusion.

Quality vs pH

## [1] "Correlation factor:  0.0791526348942642"

Correlation coefficient between quality and pH is 0.08.

pH values are very similar whatever the wine quality is. There is also a high values dispersion for all wine qualities. It does not allow us any correlation conclusion.

Quality vs Sulphates

## [1] "Correlation factor:  0.0181793067982046"

Here also, the correlation factor is pretty low with a value of 0.02. Average values are very close whatever the quality is. There is also a high values dispersion for all wine qualities. It does not allow us any correlation conclusion.

Quality vs Alcohol

## [1] "Correlation factor:  0.422835777947479"

Correlation factor between alcohol and quality is one of the highest with a value of 0.42.

For alcohol of medium and good wines, we see that quality increase with higher alcohol degrees. This is coherent with residual sugar values. When sugar value decreases, it means it has been transformed into alcohol. This correlation is definitevely not a surprise.

On every correlation analysis, due to the high dispersion of value, any possible correlation can not really be assessed.

Other explored relationships

In this section, we explore the two highest variable correlation. Please use the tabs to navigate from one chart to another.

Alcohol vs Density

We can see there is a correlation between wine dentisy and alcohol.

Residual Sugar vs Density

We can also see a correlation trend between density and residual sugar. We can observe also a high dispersion of density for low residual sugar values.

From the two above chart, we can see that the more residual sugar you have, the bigger the density is, and the lower the alcohol is. This output is not a surprise as the vinho verde white wine is supposed to be a dry wine.

Multivariate analysis

In this section, we will to analys correlation of alcohol, density and chlorides vs Quality. We will perform a bivariable analysis versus quality. It means we will get 3 different charts. With these charts, we will try to identify specific quality cluster and trend depending of the two watched variables.

Please use the tab to navigate from one chat to another.

Alcohol vs Density and Quality

Chlorides vs Density and Quality

Chlorides vs Alcohol and Quality

On these 3 charts, we can not see any data cluster popping out. In addition, we see the various trend lines are pretty similar. We see we can are not able to create a model based on two variable to estimate the wine quality.

We should create a model taking into account more variable to try estimate the wine quality. Nevertheless, we will explain in the reflexion section why we are not confident in such approach.

Final Plots and Summary

This section shows 3 differents charts and main outcome for each of them.

In this chart, we see that we see two groups of wine. The one having more than 5 g/dm^3 residual sugar and the other ones. As the vinho verder is a dry wine, we expect it to get low residual sugar value.

In this chart, we can see that some wine grower do not fully master the wine making process. They did not includes enough sulfur dioxide to ensure a non wine oxydation. It results in a poor quality wine.

In this chart, we can not see any wine quality cluster popping up. The trend lines are pretty much the same (except the high quality one (9) but not really representative due to the low number of measures).

Reflexion

Global reflexion about the outcome of this study

This wine chemistry variable analysis is really interesting. The main expected outcomes can be seens through data analysis (link between residual sugar, density and alcohol). We can also point out some issues during the wine making process by identifying that low suflur dioxide values leads to poor wine quality.

For quality projection, we see that there is not easy correlation that can be found. We have very low correlation max factors (the highest one is less than 0.5).

We can ask ourselve about the dataset itself. Do we miss some valuable information to better assess the wine quality. For instance, we do not know the wine vintage. For also do not know the grape variety used for each of the tested wines. What was the land quality on which the grappe growed.

In addition, if wine quality could only be identified through its chemical factors, we could easily creates artificial wines. This is not the case. I could only find one startup aiming to be able to create artificial wine : Ava Winery (http://www.avawinery.com). This fake wine seems not to be really good when tasted (http://www.ouest-france.fr/leditiondusoir/data/747/reader/reader.html#!preferred/1/package/747/pub/748/page/5 (french article)).

See https://www.newscientist.com/article/2088322-synthetic-wine-made-without-grapes-claims-to-mimic-fine-vintages/ http://www.radionz.co.nz/national/programmes/thiswayup/audio/201807662/lab-made-wine

Nevertheless, artificial wines can not be sold in France since 1905 due to anti-fraud low.

It that condition, I think nothing will be better than real wine tasting to identify if you like it or not.

Personnal reflexion about this analysis

This wine analysis is very interesting. The dataset had already a very quality and a significant number of observations to be representative. Even after removed the outliers we still have enough observation. The dataset was tidy. It makes all computations pretty fast.

The dataset contains (except the quality field) continuous variables. It helps the R code factorisation.

P4 lessons and project sample were usefull to build this analysis, in term of techniques (R principles and syntax) but also in term of analysis organisation.

I get some trouble making R markdown work especially with the list items. I struggled getting the layout I wanted, but I finally succeeded.

I found interesting graphical data representation using google (especially the correlation matrix).

I did not respect the R coding rules at the beginning. I modified the code for the second submission. I find some coding recommendations useful, but others seems to come from an another age. Especially the 80 characters limits as we do not use VT terminals for ages.

If we focus on the analysis itself, we were asked to answer a specific question. Within my analysis, I could not really answer this question. I was able to identify one item explaining why you get a bad wine, but could not find items explaining you get a good wine. In this analys, I only explored univariate and bivariate analysis. It means that they are a lot of possibilities I did not explore (from 4 variables up to 12 variables).

To be rigourous and to assert we can not find correlation between wine parameters and quality, I should have done all the possible analysis. I would have been a time consuming activity. Litterature conforted me thinking I would not find any reason to fully explain the wine quality. Nevertheless, this is only an assumption that I did not really demonstrate. It would have been much easier to find a parameters correlation explaining the wine quality. In our case, we need to be very carefull when writing our conclusion in order to make our conclusion indisputable.